Basic R

Curro Campuzano

An opinionated introduction to R

by the Svardal lab, based on material by Alexandros Bantounas and contributions by Curro Campuzano 1.

What is R?

R is a programming language for statistical computing and data visualization.1

It has a very rich ecosystem of packages (i.e. collections of pre-written code) to perform statistical and genomic analysis.

Work with R

  • You can enter interactive mode by executing R in the command line.
  • You can run a script executing Rscript script.R > output.txt in the command line.
  • You can use an IDE to write and execute code, for example Rstudio

Troubleshooting R

If you can’t install R:

Has anyone had any problems?

Typical R workflow

  1. Define/create a folder to be used as the working directory.

  2. Open R Studio and create a new Script file (menu). You can also create a project (button top right).

  3. Set the working directory to your prepared folder.

  4. Write your script in the script window and save it. Send selected code line(s) to the console using ctrl+Return (PC).

  5. Conduct analyses, save the script, outputs, and graphs. When the entire analysis is ready, you can compile code and output into a notebook.

Basic syntax: Operators

# You can comment your code starting a line with `#`
# 1 + 1
# You have basic operators such as
1 + 1 / 2
[1] 1.5
1 != 1
[1] FALSE
(1 + 1) > 3
[1] FALSE
# You can assign to variables using `<-`, `->`, or `=`
a <- 1234
1 + 1 -> b
c <- "abcd" # Strings can be created using quotes
d <- TRUE

Key objects: atomic vectors and lists

In R, almost everything is an atomic vector, a list, or a function.

# All of the elements in an atomic vector are only of one type
c(1, 2, 3) # Numeric vector
[1] 1 2 3
c(T, F, T) # Logical vector
[1]  TRUE FALSE  TRUE
c("A", "C", "T", "G") # Character vector
[1] "A" "C" "T" "G"
# Lists can have different types of items in different components
mylist <- list(1, 2, "A")

Key objects: using functions

Functions in R are reusable blocks of code that take inputs (arguments), perform a specific task, and return an output.

vals <- c(1, 2, 5, 1, 2)
c(vals, 13)
[1]  1  2  5  1  2 13
max(vals) # Max
[1] 5
which.min(vals) # Argmin
[1] 1
# Random values from a Uniform(a, b)
vals <- runif(n = 100, min = 0, max = 10)
summary(vals)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2222  3.1831  5.9254  5.3583  7.4306  9.9966 

Key objects: writing functions

Making your own functions allows you to automate common tasks in a more powerful way than copy-pasting.

z_score <- function(x) {
    (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

Loops in R

Loops allow us to iteratively apply a function on a list of inputs. The main loop used in this tutorial is the for loop 1:

for (j in 1:5) {
    print(j^2)
}
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25

Packages

R packages are sets of custom functions and object classes that can be installed and used. Most R packages are deposited in the CRAN repository1.

# Install packages from CRAN
install.packages("tidyverse")
# Install packages from Bioconductor
if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install(ggtree)

The tidyverse

The R language has evolved quite a lot since it was created. 1 A “modern” style of writing R code is promoted by the tidyverse package.

library(tidyverse)

The syntax library(package_name) attaches names to your active session and lets you refer to them.

Loading data

Often, you want to load data generated outside your R session (by others or a genomics pipeline)1. Tables are encoded as data frames, which are lists of equal-length vectors.

# Raw data URLs (but it could be local paths also)
uri_adelie <- "https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.219.3&entityid=002f3893385f710df69eeebe893144ff"
df <- read_csv(uri_adelie)

Inspecting the data

dim(df) # This shows the dimension of the dataframe
[1] 152  17
colnames(df) # The names of the columns
 [1] "studyName"           "Sample Number"       "Species"            
 [4] "Region"              "Island"              "Stage"              
 [7] "Individual ID"       "Clutch Completion"   "Date Egg"           
[10] "Culmen Length (mm)"  "Culmen Depth (mm)"   "Flipper Length (mm)"
[13] "Body Mass (g)"       "Sex"                 "Delta 15 N (o/oo)"  
[16] "Delta 13 C (o/oo)"   "Comments"           
glimpse(df) # Quick overview
Rows: 152
Columns: 17
$ studyName             <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL…
$ `Sample Number`       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ Species               <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie P…
$ Region                <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"…
$ Island                <chr> "Torgersen", "Torgersen", "Torgersen", "Torgerse…
$ Stage                 <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adu…
$ `Individual ID`       <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", …
$ `Clutch Completion`   <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", …
$ `Date Egg`            <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16,…
$ `Culmen Length (mm)`  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34…
$ `Culmen Depth (mm)`   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18…
$ `Flipper Length (mm)` <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190,…
$ `Body Mass (g)`       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 34…
$ Sex                   <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE"…
$ `Delta 15 N (o/oo)`   <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18…
$ `Delta 13 C (o/oo)`   <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.298…
$ Comments              <chr> "Not enough blood for isotopes.", NA, NA, "Adult…

Selecting and subsetting

x <- df[1, ] # Accessing the first row.
x <- df[, 1] # Accessing the first column.
df[, "studyName"] # Accessing the column by name
# A tibble: 152 × 1
   studyName
   <chr>    
 1 PAL0708  
 2 PAL0708  
 3 PAL0708  
 4 PAL0708  
 5 PAL0708  
 6 PAL0708  
 7 PAL0708  
 8 PAL0708  
 9 PAL0708  
10 PAL0708  
# ℹ 142 more rows
df[1, "Species"] # Accessing the first row of the column "Species"
# A tibble: 1 × 1
  Species                            
  <chr>                              
1 Adelie Penguin (Pygoscelis adeliae)

Advanced dataset manipulation

For more advanced data manipulations, you can use functions from the dplyr package and chain operations by passing the output of one function as input to the next one using the %>% pipe operator.

c(1, 2, NA, 5) %>%
    sum() %>%
    as.character()
[1] NA
c(1, 2, NA, 5) %>%
    sum(na.rm = T) %>%
    as.character()
[1] "8"

Example of dplyr manipulation

Could you guess what is happening exactly?

df %>%
    mutate(Sex = tolower(Sex)) %>%
    filter(Sex == "female") %>%
    filter(Island %in% c("Torgersen", "Biscoe", "Dream")) %>%
    filter(!is.na(Stage)) %>%
    select("Island", starts_with("Culmen")) %>%
    slice_sample(n = 5)
# A tibble: 5 × 3
  Island    `Culmen Length (mm)` `Culmen Depth (mm)`
  <chr>                    <dbl>               <dbl>
1 Dream                     38.1                18.6
2 Torgersen                 39                  17.1
3 Biscoe                    39.6                17.7
4 Torgersen                 35.9                16.6
5 Dream                     36.8                18.5

Example of dplyr manipulation

df2 <- df %>%
    # mutate() is used to create columns
    mutate(Sex = tolower(Sex)) %>%
    # filter() by column Value
    filter(Sex == "female") %>%
    # filter() by list of values
    filter(Island %in% c("Torgersen", "Biscoe", "Dream")) %>%
    # filter() by missing values
    filter(!is.na(Stage)) %>%
    # select() certain columns by index, name or pattern
    select("Island", starts_with("Culmen")) %>%
    # Take a random sample of rows
    slice_sample(n = 5)

Plotting using base R

In base R, there are many convenient plots that just “work” when you attempt to plot different objects. However, for final plots, it is not always the most convenient.

Plot in base R example (1/2)

cols <- c("Culmen Length (mm)", "Flipper Length (mm)", "Body Mass (g)")
plot(df[, cols])

Plot in base R example (2/2)

hist(df[["Culmen Length (mm)"]],
    main = "Distribution of Culmen Length",
    xlab = "Culmen Length (mm)",
    col = "skyblue",
    border = "black",
)

Plotting using ggplot

# Set a theme
theme_set(ggthemes::theme_tufte())
df %>%
    # Discard individuals with unknown sex
    filter(!is.na(Sex)) %>%
    # Create a plot with Body mass in the x-axis and fill by Sex
    ggplot(aes(x = `Body Mass (g)`, fill = Sex)) +
    # Plot an histogram
    geom_histogram(color = "black", alpha = 0.8) +
    xlab("Count") + # X-axis label
    ggtitle("Histogram example") # Add title

Plotting using ggplot

Plotting using ggplot

df %>%
    filter(!is.na(Sex)) %>%
    ggplot(aes(x = `Flipper Length (mm)`, y = `Body Mass (g)`, colour = Sex)) +
    geom_smooth(method = lm, linetype = "dashed") +
    geom_point(shape = 1) +
    facet_wrap(~Island) +
    ggtitle(
        label = "Flipper Length versus Body Mass",
        subtitle = "We don't see differences between islands but between sexes in Adelie Penguin"
    )

Plotting using ggplot